Skip to main content

Creating a New Synthetic Dataset

The Synthetic Data feature allows users to generate customized datasets using advanced models like GPT-4. The process involves selecting input data from files or databases, configuring the data generation process, and specifying filters and output formats.

This document outlines the steps involved in the creation process, from selecting the model to finalizing the dataset configuration.

Steps to Create a New Synthetic Dataset

1. Select Modal

In the first step, users provide basic information about the synthetic dataset:

  • Name: Enter the name of the dataset.
  • Modeling Type: Select the LLM model (e.g., GPT-4) to use for data generation.
  • Prompt: Optionally, enter a prompt to guide the model in data generation.

Once you've filled in these details, click Next to move to the configuration options.


2. Configure Input Source

The second step involves configuring the input source and fine-tuning how data will be generated. You can choose between two options: File or Database.

File Input Configuration

For file-based input, the user has the following options:

  • Type: Choose whether to work with a JSON file, text file, or upload a new file.
  • Upload New: Drag and drop a file into the input area or select a file from your computer.
  • Generated Sample Size: Set the number of data points to generate.
  • Temperature: Adjust the creativity of the model, with higher temperatures resulting in more varied outputs.
  • Top P: Control the diversity of the generated content by limiting the range of predicted words.

Database Input Configuration

For database-based input, the user can connect to a database:

  • Name and Host: Provide the name of the database and its host details.
  • Generated Sample Size: Similar to file input, define how many data points will be generated.
  • Temperature and Top P: These options work the same way as in file input, allowing users to control creativity and diversity.

After configuring the input, click Next to proceed to the column and filter selection.


3. Column Selection

In this step, you can specify which columns from the input dataset should be included in the generated synthetic data:

  • Select Table: Choose from the available tables, particularly when using database inputs.
  • Select Columns: From each table, select the relevant columns that should be part of the final dataset. You can select all columns or pick specific ones based on your needs.

4. Filter Selection

In the filter selection step, users can refine the dataset further:

  • Select Filter: Apply filters based on conditions like column values, ranges, and limits.
  • Custom Filter Conditions: Define custom filter conditions to tailor the dataset to your exact requirements. You can add multiple filters and customize them by conditions like equal to (=), greater than (>), less than (<), etc.

Once filters are applied, click Next to review the final summary.